Preliminary Program – ASPLOS 2025
The following content is copyrighted from ASPLOS 2025 | Awesome Papers, and I have extracted parts that are of interest.
LLM Inference
- Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow [arXiv] [Code]
- CMU
- Accelerating LLM Serving for Multi-turn Dialogues with Efficient Resource Management
- Korea University
- COMET: Towards Practical W4A4KV4 LLMs Serving
- ICT, CAS
- Past-Future Scheduler for LLM Serving under SLA Guarantees
- Beihang University & SenseTime
- POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference
- UW & MSR India
- Medusa: Accelerating Serverless LLM Inference with Materialization
- THU
- vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention [arXiv]
- MSR India & IIS
- TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms
- UIUC & Microsoft Azure Research
- PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System
- ICT, CAS & ETH & UofT & NVIDIA
- PIM is All You Need: A CXL-Enabled GPU-Free System for LLM Inference
- UMich & ETH & Google
- Fast On-device LLM Inference with NPUs
- PKU & BUPT
LLM-based Applications
- Towards End-to-End Optimization of LLM-based Applications with Ayo
- CUHK
MoE Inference
- MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
- UC Berkeley
- Klotski: Efficient Mixture-of-Expert Inference via Expert-Aware Multi-Batch Pipeline
- SYSU & HKUST & Huawei & Peng Cheng Laboratory
Retrieval-Augmented Generation (RAG)
- Accelerating Retrieval-Augmented Generation
- Cornell & Kansas & UMass Amherst & Samsung Electronics
Resource Management
- Shared ML Clusters
- Design and Operation of Shared Machine Learning Clusters on Campus
- HKUST
- Design and Operation of Shared Machine Learning Clusters on Campus
- Resource Oversubscription
- Coach: Exploiting Temporal Patterns for All-Resource Oversubscription in Cloud Platforms
- Microsoft
- Coach: Exploiting Temporal Patterns for All-Resource Oversubscription in Cloud Platforms
- Serverless Computing
- Litmus: Fair Pricing for Serverless Computing
- Binghamton & Intel Lab
- Concurrency-Informed Orchestration for Serverless Functions
- UVA & Alibaba & Amazon
- Dilu: Enabling GPU Resourcing-on-Demand for Serverless DL Serving via Introspective Elasticity
- ICT, CAS
- Litmus: Fair Pricing for Serverless Computing
- Graceful Degradation
- Cooperative Graceful Degradation in Containerized Clouds [arXiv]
- UC Irvine
- Cooperative Graceful Degradation in Containerized Clouds [arXiv]
- Microservice
- Embracing Imbalance: Dynamic Load Shifting among Microservice Containers in Shared Clusters
- University of Macau
- Embracing Imbalance: Dynamic Load Shifting among Microservice Containers in Shared Clusters
- GPU Sharing
- Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads [arXiv]
- Stanford & UofT
- Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads [arXiv]